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Abstract: We combine a refined version of two-point step-size adaptation with 
the covariance matrix adaptation evolution strategy (CMA-ES). Additionally, 
we suggest polished formulae for the learning rate of the covariance matrix and 
the recombination weights. In contrast to cumulative step-size adaptation or to 
the 1/5-th success rule, the refined two-point adaptation (TPA) does not rely on 
any internal model of optimality In contrast to conventional self- adaptation, the 
TPA will achieve a better target step-size in particular with large populations. 
The disadvantage of TPA is that it relics on two additional objective function 
evaluations. 
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1 Introduction 

In the Covariance Matrix Evolution Strategy (CMA-ES) [5] two separate adap- 
tation mechanism are performed to determine variances and covariances of the 
search distribution. One for (overall) step-size control, a second for adapta- 
tion of a covariance matrix. The mechanisms are mainly independent and can 
therefore, in principle, be replaced separately. While the standard step-size con- 
trol is cumulative step-size adaptation (CSA), also a success-based control was 
successfully introduced for the (1+A)-CMA-ES in [9]. 
The CSA has a few drawbacks. 

• For very large noise levels the target step-size becomes zero, while the 
optimal step-size is still positive [3]. 

• For large population sizes (A > 10 n) the original parameter setting seemed 
not to work properly [5] — the notion of tracking a (long) path history 
seems not to perfectly mate with a large population size (large compared 
to the search space dimension). An improved parameter setting introduced 
in [5] shortens the backward time horizon for the cumulation and performs 
well also with large population sizes [2] • 

• The expected size for the displacement of the population mean under ran- 
dom selection is required. To compute a useful measurement independent 
of the coordinate system, the principle axes of the search distribution are 
needed. They are more expensive to acquire (at least by a constant fac- 
tor) than a simple matrix decomposition that is in any case necessary to 
sample a multivariate normal distribution with given covariance matrix. 

• Because the length of an evolution path is compared to its expected length, 
the measurement is sensitive to the specific sample procedure of new can- 
didate solutions and also, for example, to repair mechanisms for solutions. 

Despite these disadvantages, CSA is regarded as first choice for step-size con- 
trol in the (/z/Vw, A)-ES, due to its advantages. Nonetheless, the disadvantages 
rise motivation to search for alternatives. Here, we suggest two-point step-size 
adaptation (TSA) as one such alternative. 

Two-point self-adaptation was introduced for backpropagation in [llj and 
later applied in Evolutionary Gradient Search |10) . In evolutionary search, 
two-point adaptation resembles self-adaptation on the population level. The 
principle is utmost simple: two different step lengths are tested for the mean 
displacement and the better one is chosen. In the next section, we integrate a 
slightly refined TSA in the CMA-ES and additionally introduce polished for- 
mulae for the recombination weights and the learning rates of the covariance 
matrix. 
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2 The Algorithm: CMA-ES with TPA 

Our description of the CMA-ES closely follows [U [5J [7] and replaces CSA with 
TSA. Given an initial mean value m € K™, the initial covariance matrix C = I 
and the initial step-size a G R+, the new candidate solutions x k obey 

x k = m + ay k , for k = 1, . . . , A , (1) 

where y& ~ A/"(0, C) denotes the realization of a normally distributed random 
vector with zero mean and covariance matrix C. The solutions x k are evaluated 
and ranked such that Xi : \ becomes the i-th best solution vector and yi.\ the 
corresponding random vector realization. 
For /i < A let 

fj, n 

(v) =^2wiVi-.x, Wl > ■ ■ ■ > > 0, ^Wt = l (2) 

i=l i=l 

be the weighted mean of the \x best ranked y& vectors. The recombination 
weights sum to one. The variance effective selection mass is defined as 

Mw " EU " Eti «? - 1 • (3) 

From the definition follows that 1 < /i w < /i and /i w = [i for equal recombination 
weights. The role of /i w is analogous to the role of the parent number [i when the 
recombination weights are all equal. Usually /i w « A/4 is appropriate. Weighted 
recombination is discussed in more detail in pQ. 
The default parameter values are 

A = 4 + [3 lnnj , // = - , fx = \p!] and (4) 

ln(/i' + 0.5)~lnz . 

Wi = v-m /i / ^ I » r\ 5 — T for I = 1, . . . , (i , (5) 

Ei=i( ln O' + 0.5) -In j) 

where [//] denotes the integer value closest to //, preferably chosing the smaller 
integer value in case, such that tur^/i > 0. The first [0.2//] weights sum to about 
0.5. Conducting restarts with increasing value of A is a valuable option [2]. 

In the remainder, the generation step is completed with the updates of m, 
a, and C, where two additional state variables, a s 6 R and p c G R", will be 
introduced and the method parameters are discussed in Section f2.4l 

2.1 The Mean 

The distribution mean is updated according to 

m <— m + a (y) . (6) 
Given a from Equation ([T]), Equation ([6]) can also be written as 



^WtXi-.x . (7) 



=i 
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2.2 Step-Size Control: Two-Point Adaptation (TPA) 

A two-point self-adaptive scheme is implemented based on [10]. We compute 
two additional function evaluations 

/+ = f(m + a'a(y)) (8) 
/_ = f(m-a'o-(y)) , (9) 

where / is the objective function to be minimized, m is the new (updated) 
mean value, and a! w 0.5 is the test width parameter. The factor ±o/ in the 
equations is chosen symmetrical about the new mean m. 

The step-size should increase if /+ is better than /_, and decrease otherwise. 
Using the values /+ and /_ we set 

_ J —a + < 0, if /_ is better (smaller) than / + , . 

aact ~ \ a > 0, otherwise { > 

Initializing a s — 0, the new step-size is calculated according to 

a s <— a s + c Q (a act - a s ) = (1 - c a ) a s + c Q a act (11) 
a <— a x exp (a s ) (12) 

where 1 /c a > 1 determines the backward time horizon for smoothing the step- 
size changes in the generation sequence. The default parameter settings are 

a' = 0.5, a = 0.5, = 0, c a = 0.3 . (13) 



Comparison to the previous formulation The two-point step-size adap- 
tation described here differs from fTD] in that the test steps are distinguished 
from the step-size changes by using (i) a symmetrical test step about the new 
m, (ii) different test width and change parameters and (iii) a smoothing for 
the step-size change. Furthermore, the original step-size is used for updating 
m. Setting a' = 0.8, a = ln(1.8) w 0.588, (3 = 0, c a = 1, replacing —a' with 
—a'/(l + a') in Equation ([9]) and using the new step-size for finally updating 
the mean m recovers the step-size adaptation from |10j . We do not expect an 
essentially different behavior due to our refinements in most cases. 

Step-size changes are essentially multiplicative. A factor exp(±a) can be 
used to realize changes of a, which is symmetrical about 1 in the log scale. 
On the other hand, using such factors for generating test steps extends the 
step further by exp(+a) > 1 than reducing it by exp(— a) < 1. Assuming the 
most simple spherical objective function model and optimal step-size, where 
f(m + a(y)) about the new mean m is minimal for a = and 

f(m + a{y)) = f(m-a{y)) , 

a larger test step 

/(m id + exp(a")cr(y)) = f(m + a'a(y)) , 
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given a" = ln(l + a'), is disfavored compared to 



/(m id + exp(-a")a(y)) = / (m- — — 7 a(y 

\ 1 + a' 

The step-size will systematically decrease, the target step-size is smaller than 
the optimal step-size. On simple functions, like the sphere model, this effect 
might well lead to a performance improvement, because the optimum can be 
approached quickly and therefore the optimal step-size decreases fast. The sub- 
optimal target step-size "anticipates" this change. Nevertheless, in general, we 
tend to favor an agreement of target and optimal step-size and therefore we are 
in favor of symmetrical test steps HI 

Following [ID], the update of m in Equation (JBJ could be postponed until 
after the step-size is updated in Equation ([12]) (Equations flSJ and (JU) must 
be revised accordingly using the old mean value). Whether or not this results 
in a better m cannot be decided without additional costs, because neither the 
original step-size nor the updated step-size are usually tested. Furthermore, 
Equation ([7]) would not hold anymore. Empirically, using the new step-size leads 
to slightly higher convergence rates in norm optimization (sphere function) in 
small dimensions. 

2.3 Covariance Matrix Adaptation (CMA) 

The covariance matrix admits a rank-one and a rank-/! update. For the rank-one 
update an evolution path p c is constructed. 



p c «- (1 - c c ) p c + h a y/c c (2- C c )/i w (y) (14) 

C <- (l-a-c^C + a PcPc + c M w iVi-xyJ : x > ( 15 ) 



rank-one update 



rank-/! update 

where h a = if a s > (1 — (1 — c Q ) 9 )(l — (1 — c a ) 9 ) a, and 1 otherwise, where 
g is the generation counter. The update of p c is stalled when a s is large. The 
stall is decisive after a change in the environment which demands a significant 
increase of the step-size. Fast changes of the distribution shape are postponed 
until after the step-size has increased to a reasonable value. 

For the covariance matrix update, the cumulation in (|14[) serves to capture 
dependencies between consecutive steps. Dependency information would be lost 
for c c = 1, because a change in sign of p c or y i: \ does not matter in (f]~5|) . 

The default parameter settings are 

A 

Mcov = Mw, (16) 



4' 



1 Good algorithm design must at times prefer the reasonable to the optimal performance 
in order to avoid ovcrfitting to specific test scenarios. 
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Cl= (n + 1.3) 2+Mcov ' ^^^M^' V ' (T) 
2.4 Discussion of Parameters 

The default values for all parameters, namely offspring population size A, re- 
combination weights Wi—i t ...^, cumulation parameter c c , mixing number /i cov , 
and learning rates C\ and c M follow [H [5j [JJ and were given above, as well as 
the step-size parameters test width a', changing factor a, update bias /? and 
smoothing parameter c a . The changes of parameters compared to [H [5J [7J are 
minor polishings. We discuss some settings in detail. 

Recombination weights Compared to [H [7], where // = [(A — l)/2] we 
have chosen fi' = (A — l)/2. The small difference occurs only for even 
A. In the former version, given odd population size A, the recombination 
weights did not change when A was reduced by one. In the present version 
the recombination weights always adjust to changes of A. 

c\ and c M are the learning rates for the rank-one and rank-/Lt update of the 
covariance matrix respectively. In [H [5J [51 [7J , a learning rate c cov w ci+c^ 
is used such that c\ ~ c cov //i cov and c M ~ c cov (/z cov — l)/^ cov . In the former 
formulation, c\ was almost two times smaller for values of /i cov « 2 than 
for /i cov = 1 and did not monotonously decrease with larger fi cov . 

c a determines the smoothing of a s . Smoothing and choosing a small (damping) 
suppress stochastic fluctuations of a. In contrast to choosing a small, 
smoothing does not affect the maximal possible change rate for a. For 
Ca > 0.5 we find a ac ta s > 0. Signs of the recent measurement and the 
actual change always agree and the smoothing cannot lead to oscillations. 
For c Q > 0.3 only after a second agreeing measure for a act we have always 
ctactCts > 0. Even smaller values for c a might be useful, but for much 
smaller values, presumably a must be chosen more carefully (smaller). 

P is the bias parameter for the step-size change. On potentially noisy or highly 
rugged functions (3 should be set to 0.2 a which results in an effective noise 
handling. 

3 Empirical Validation 

In empirical investigations of the TPA-CMA-ES, we find the expected, feasible 
behavior. The comparison with CSA shows no clear winner. Depending on 
the objective function either TPA or CSA is faster, but the factor is seldom 
larger than two. Surprisingly, in our exploratory simulations, there is no clear 
winner depending on dimension or population size or noise. On noisy functions, 
setting (3 = 0.2 a = 0.1 for TPA is quite effective, while we observe only a 
minor effect from this change otherwise. We did not extensively try to exploit 
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potential weaknesses (as has been done for CSA), but we suspect that the TPA 
is a feasible and robust alternative to CSA. 

4 Conclusion and Outlook 

We see some principle advantages of using two-point step-size adaptation 
(TPA) in the CMA-ES. 

• The TPA does not rely on a predefined optimality condition, like a success 
rate of 1/5 or conjugate-perpendicularity of consecutive steps. 

• The TPA does not rely on specific properties of the sample distribution 
or the selection of solutions. Therefore, it is supposably less sensitive to 
any modifications of the underlying algorithm, in particular compared to 
CSA. 

• The step-size change rate can be adjusted mainly independently from 
TPA-internal considerations. Time averaging or damping are not essen- 
tially necessary. 

Even so, we see two principle disadvantages of TPA. 

• Two additional function evaluations are needed per iteration step. This 
is not a grave disadvantage, in particular when the population size is not 
very small. As a possible remedy, these two points could be incorporated 
in the population and used to compute the (final) mean in Equation ([7|), 
and one of them might be used in the rank-/x update of the covariance 
matrix. 

• Step-size control is based on two objective function evaluations only. Selec- 
tion information from the remaining population (and history information) 
is somewhat disregarded. This is a conceptional defect, that might be 
irrelevant in practice. 

In conclusion, two-point step-size adaptation is an alternative to cumulative 
step-size adaptation well worth of further exploration. Whether and when it 
should finally replace CSA in practice must be answered in future empirical 
studies. 
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